Air Quality in Dar es Salaam 🇹🇿

Prepare Data

Connect

Task 3.5.1: Connect to MongoDB server running at host "localhost" on port 27017. Then connect to the "air-quality" database and assign the collection for Dar es Salaam to the variable name dar.

Explore

Task 3.5.2: Determine the numbers assigned to all the sensor sites in the Dar es Salaam collection. Your submission should be a list of integers.

Task 3.5.3: Determine which site in the Dar es Salaam collection has the most sensor readings (of any type, not just PM2.5 readings). You submission readings_per_site should be a list of dictionaries that follows this format:

[{'_id': 6, 'count': 70360}, {'_id': 29, 'count': 131852}]

Note that the values here ☝️ are from the Nairobi collection, so your values will look different.

Import

Task 3.5.4: (5 points) Create a wrangle function that will extract the PM2.5 readings from the site that has the most total readings in the Dar es Salaam collection. Your function should do the following steps:

  1. Localize reading time stamps to the timezone for "Africa/Dar_es_Salaam".
  2. Remove all outlier PM2.5 readings that are above 100.
  3. Resample the data to provide the mean PM2.5 reading for each hour.
  4. Impute any missing values using the forward-will method.
  5. Return a Series y.

Use your wrangle function to query the dar collection and return your cleaned results.

y = wrangle(dar) print(y.head())

Explore Some More

Task 3.5.5: Create a time series plot of the readings in y. Label your x-axis "Date" and your y-axis "PM2.5 Level". Use the title "Dar es Salaam PM2.5 Levels".

Task 3.5.6: Plot the rolling average of the readings in y. Use a window size of 168 (the number of hours in a week). Label your x-axis "Date" and your y-axis "PM2.5 Level". Use the title "Dar es Salaam PM2.5 Levels, 7-Day Rolling Average".

Task 3.5.7: Create an ACF plot for the data in y. Be sure to label the x-axis as "Lag [hours]" and the y-axis as "Correlation Coefficient". Use the title "Dar es Salaam PM2.5 Readings, ACF".

Task 3.5.8: Create an PACF plot for the data in y. Be sure to label the x-axis as "Lag [hours]" and the y-axis as "Correlation Coefficient". Use the title "Dar es Salaam PM2.5 Readings, PACF".

Split

Task 3.5.9: Split y into training and test sets. The first 90% of the data should be in your training set. The remaining 10% should be in the test set.

Build Model

Baseline

Task 3.5.10: Establish the baseline mean absolute error for your model.

Iterate

Task 3.5.11: You're going to use an AR model to predict PM2.5 readings, but which hyperparameter settings will give you the best performance? Use a for loop to train your AR model on using settings for p from 1 to 30. Each time you train a new model, calculate its mean absolute error and append the result to the list maes. Then store your results in the Series mae_series.

Task 3.5.12: Look through the results in mae_series and determine what value for p provides the best performance. Then build and train final_model using the best hyperparameter value.

Note: Make sure that you build and train your model in one line of code, and that the data type of best_model is statsmodels.tsa.ar_model.AutoRegResultsWrapper.

Task 3.5.13: Calculate the training residuals for best_model and assign the result to y_train_resid. Note that your name of your Series should be "residuals".

Task 3.5.14: Create a histogram of y_train_resid. Be sure to label the x-axis as "Residuals" and the y-axis as "Frequency". Use the title "Best Model, Training Residuals".

Task 3.5.15: Create an ACF plot for y_train_resid. Be sure to label the x-axis as "Lag [hours]" and y-axis as "Correlation Coefficient". Use the title "Dar es Salaam, Training Residuals ACF".

Evaluate

Task 3.5.16: Perform walk-forward validation for your model for the entire test set y_test. Store your model's predictions in the Series y_pred_wfv. Make sure the name of your Series is "prediction" and the name of your Series index is "timestamp".

Task 3.5.17: Submit your walk-forward validation predictions to the grader to see test mean absolute error for your model.

Communicate Results

Task 3.5.18: Put the values for y_test and y_pred_wfv into the DataFrame df_pred_test (don't forget the index). Then plot df_pred_test using plotly express. Be sure to label the x-axis as "Date" and the y-axis as "PM2.5 Level". Use the title "Dar es Salaam, WFV Predictions".


Copyright © 2022 WorldQuant University. This content is licensed solely for personal use. Redistribution or publication of this material is strictly prohibited.